{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### You can also run the notebook in [COLAB](https://colab.research.google.com/github/deepmipt/DeepPavlov/blob/master/examples/super_convergence_tutorial.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install deeppavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Super Convergence in DeepPavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In [the paper by Leslie N. Smith, Nicholay Topin](https://arxiv.org/abs/1708.07120) authors introduced a phenomenon called \"super-convergence\", where \n",
    "  * <font color='green'>neural networks can be trained</font> an order of magnitude <font color='green'>faster</font> than with standard training methods,\n",
    "  * there is <font color='green'>a greater boost in performance</font> relative to standard training <font color='green'>when the amount of labeled training data is limited</font>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tutorial Plan:\n",
    "\n",
    "0. [What is Super Convergence?](#0.-What-is-Super-Convergence?)\n",
    "1. [DeepPavlov learning rate schedules](#1.-Learning-rate-schedules)\n",
    "     * [LRScheduledTFModel](#LRScheduledTFModel) [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/core/models/lr_scheduled_tf_model.py)\n",
    "     * [DecayType.NO](#DecayType.NO)\n",
    "     * [DecayType.LINEAR](#DecayType/LINEAR)\n",
    "     * [DecayType.COSINE](#DecayType/COSINE)\n",
    "     * [DecayType.EXPONENTIAL](#DecayType.EXPONENTIAL)\n",
    "     * [DecayType.POLYNOMIAL](#DecayType.POLYNOMIAL)\n",
    "     * [DecayType.ONECYCLE](#DecayType.ONECYCLE)\n",
    "     * [DecayType.TRAPEZOID](#DecayType.TRAPEZOID)\n",
    "     \n",
    "\n",
    "2. [DeepPavlov learning rate search](#2.-Optimal-learning-rate-search)\n",
    "3. [DeepPavlov Super Convergence](#3.-Super-Convergence)\n",
    "\n",
    "### Useful materials\n",
    "   * Original Super Convergence Paper [\"Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates\" by Leslie N. Smith, Nicholay Topin](https://arxiv.org/abs/1708.07120)\n",
    "   * Post by Sylvian Gugger on [\"How do you find an optimal learning rate\"](https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html)\n",
    "   * [1cycle policy overview](https://sgugger.github.io/the-1cycle-policy.html#the-1cycle-policy)\n",
    "   * Post by fast.ai with results on CIFAR10, [\"Training Imagenet in 3 hours for 25dollars; and CIFAR10 for 0.26dollars\"](https://www.fast.ai/2018/04/30/dawnbench-fastai/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 0. What is Super Convergence?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The simplest explanation of what it is:\n",
    "  - method that helps to train complex neural models faster.\n",
    "\n",
    "As an example, see how it allows to train a resnet-56 on cifar10 to the same or a better precision than the authors in their original paper but with far less iterations.\n",
    "\n",
    "By training with high learning rates you can reach a model that gets 93% accuracy in 70 epochs which is less than 7k iterations (as opposed to the 64k iterations which made roughly 360 epochs in the original paper)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_loss_comparison.png](img/sc_loss_comparison.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate. A primary insight that allows super-convergence training is that large learning rates regularize the training, hence requiring a reduction of all other forms of regularization in order to preserve an optimal regularization balance.\n",
    "\n",
    "Experiments demonstrate super-convergence for Cifar-10/100, MNIST and Imagenet datasets, and resnet, wide-resnet, densenet, and inception architectures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Learning rate schedules"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### LRScheduledTFModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`class LRScheduledTFModel` [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/core/models/lr_scheduled_tf_model.py):\n",
    "  * initializes optimizer\n",
    "  * updates learning rate and momentum according to a schedule configured in config\n",
    "  * can search for an optimal learning rate\n",
    "  \n",
    "That means that your model doesn't need to handle learning rate and momentum placeholders and initialize optimizer. Just inherit your class from `LRScheduledMOdel`:\n",
    "\n",
    "```python\n",
    "from deeppavlov.core.models.lr_scheduled_tf_model import LRScheduledTFModel\n",
    "\n",
    "class MyModel(LRScheduledTFModel):\n",
    "```\n",
    "\n",
    "Examples of wrapped in `LRScheduledTFModel` models are:\n",
    "   * Goal-Oriented Bot [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/models/go_bot/network.py) [[configs]](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/go_bot)\n",
    "   * Named Entity recognizer [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/models/ner/network.py) [[configs]](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/ner)\n",
    "   * SQUAD model [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/models/squad/squad.py) [[configs]](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/squad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### LRScheduledKerasModel "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`class LRScheduledKerasModel` [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/core/models/keras_model.py):\n",
    "  * updates learning rate and momentum according to a schedule configured in config\n",
    "  * can search for an optimal learning rate\n",
    "  \n",
    "That means that your model doesn't need to handle learning rate and momentum placeholders, just need to initialize optimizer and compile model. Just inherit your class from `LRScheduledKerasModel`:\n",
    "\n",
    "```python\n",
    "from deeppavlov.core.models.keras_model import LRScheduledKerasModel\n",
    "\n",
    "class MyModel(LRScheduledKerasModel):\n",
    "```\n",
    "\n",
    "Example of wrapped in `LRScheduledKerasModel` models is:\n",
    "   * Keras classification model [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/models/classifiers/keras_classification_model.py) [[configs]](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/classifiers/rusentiment_bigru_superconv.json)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Optimizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can set optimizer by:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"optimizer\": \"tf.train:AdadeltaOptimizer\"\n",
    "}\n",
    "```\n",
    "\n",
    "If no `optimizer` is mentioned then `tf.train:AdamOptimizer` will be used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DecayType.NO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": 0.1,\n",
    "    \"learning_rate_decay\": \"no\"\n",
    "}\n",
    "```\n",
    "\n",
    "or just\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": 0.1\n",
    "}\n",
    "```\n",
    "\n",
    "corresponds to the following learning rate update schedule:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_no.png](img/sc_ner_lr_no.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DecayType.LINEAR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.01, 0.1]\n",
    "    \"learning_rate_decay\": \"linear\",\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```\n",
    "corresponds to :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_linear.png](img/sc_ner_lr_linear.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or reverse `learning_rate` parameter to go from larger learning rate to smaller: \n",
    "\n",
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.1, 0.01]\n",
    "    \"learning_rate_decay\": \"linear\",\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_linear2.png](img/sc_ner_lr_linear2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DecayType.COSINE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.1, 0.01]\n",
    "    \"learning_rate_decay\": \"cosine\",\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_cosine.png](img/sc_ner_lr_cosine.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DecayType.EXPONENTIAL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.1, 0.01]\n",
    "    \"learning_rate_decay\": \"exponential\",\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```\n",
    "corresponds to :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_exponential.png](img/sc_ner_lr_exponential.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DecayType.POLYNOMIAL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.1, 0.01]\n",
    "    \"learning_rate_decay\": [\"polynomial\", 1.0],\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```\n",
    "corresponds to:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_polynomial.png](img/sc_ner_lr_polynomial.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Polynomial decay has a parameter of \"decay power\" (which was equal to `1.0` preciously).\n",
    "\n",
    "Let's try decay power value of `0.1`:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.1, 0.01]\n",
    "    \"learning_rate_decay\": [\"polynomial\", 0.1],\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_polynomial1.png](img/sc_ner_lr_polynomial1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And decay power value of `10`:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.1, 0.01]\n",
    "    \"learning_rate_decay\": [\"polynomial\", 10],\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_polynomial2.png](img/sc_ner_lr_polynomial2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DecayType.ONECYCLE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.01, 0.1]\n",
    "    \"learning_rate_decay\": \"onecycle\",\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```\n",
    "corresponds to :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_onecycle.png](img/sc_ner_lr_onecycle.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DecayType.TRAPEZOID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate\": [0.01, 0.1]\n",
    "    \"learning_rate_decay\": \"trapezoid\",\n",
    "    \"learning_rate_decay_batches\": 800\n",
    "}\n",
    "```\n",
    "corresponds to :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_trapezoid.png](img/sc_ner_lr_trapezoid.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Optimal learning rate search"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also tune learning rate on data before training.\n",
    "\n",
    "Add `fit_on` and `fit_batch_size` in your component along with desired `learning_rate_decay` (+`learning_rate_decay_batches`), and `learning_rate` parameter will be set automatically.\n",
    "\n",
    "For example,\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate_decay\": \"trapezoid\",\n",
    "    \"learning_rate_decay_batches\": 800,\n",
    "    \n",
    "    \"fit_batch_size\": 16,\n",
    "    \"fit_on\": [\"x0\", \"x1\", \"x2\", \"y\"]\n",
    "}\n",
    "```\n",
    "\n",
    "will find an optimal `learning_rate` for your trapezoid update schedule.\n",
    "\n",
    "`DecayType.NO`, `DecayType.LINEAR`, `DecayType.POLYNOMIAL`, `DecayType.EXPONENTIAL`, `DecayType.ONECYCLE`, `DecayType.TRAPEZOID` are all supported in learning rate search mode."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Super Convergence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Super Convergence is then equivalent to the following config parameters:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"learning_rate_decay\": \"onecycle\",\n",
    "    \"learning_rate_decay_batches\": 1000, #hyperparameter\n",
    "    \n",
    "    \"fit_batch_size\": 16, #hyperparameter\n",
    "    \"fit_on\": [\"x0\", \"x1\", \"x2\", \"y\"],\n",
    "    \n",
    "    \"momentum\": [0.95, 0.85],\n",
    "    \"momentum_decay\": \"onecycle\",\n",
    "    \"momentum_decay_batches\": 1000 #hyperparameter\n",
    "}\n",
    "```\n",
    "for any optimizer. Which will result in similar to the following learning rate and momentum update schedules:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_sc.png](img/sc_ner_lr_sc.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For `tf.train:AdamOptimizer` is it recommended to use trapezoid update schedule:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"class_name\": \"my_model\",\n",
    "    ...\n",
    "    \"optimizer\": \"tf.train:AdamOptimizer\",\n",
    "    \"learning_rate_decay\": \"trapezoid\",\n",
    "    \"learning_rate_decay_batches\": 1000, #hyperparameter\n",
    "    \n",
    "    \"fit_batch_size\": 16, #hyperparameter\n",
    "    \"fit_on\": [\"x0\", \"x1\", \"x2\", \"y\"],\n",
    "    \n",
    "    \"momentum\": [0.95, 0.85],\n",
    "    \"momentum_decay\": \"trapezoid\",\n",
    "    \"momentum_decay_batches\": 1000 #hyperparameter\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![cs_ner_lr_sc1.png](img/sc_ner_lr_sc1.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "accelerator": "GPU",
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
